-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kubernetes leader election] Run leader elector at all times #4542
Conversation
This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
NOTE: |
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection.go
Outdated
Show resolved
Hide resolved
Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
…ernetes_leaderelection.go Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes until we get an answer on if this needs rate limiting and the impacts on the k8s control plane traffic.
The current implementation definitely needs to change, but it's not clear to me that this change won't just make rapid calls to acquire leases as fast as possible yet.
return comm.Err() | ||
|
||
for { | ||
le.Run(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a cluster error resulting in the leader continuously losing the lease, will this result in attempts to acquire it as quickly as possible with no rate limit?
The implementation of Run
I see is:
// Run starts the leader election loop. Run will not return
// before leader election loop is stopped by ctx or it has
// stopped holding the leader lease
func (le *LeaderElector) Run(ctx context.Context) {
defer runtime.HandleCrash()
defer func() {
le.config.Callbacks.OnStoppedLeading()
}()
if !le.acquire(ctx) {
return // ctx signalled done
}
ctx, cancel := context.WithCancel(ctx)
defer cancel()
go le.config.Callbacks.OnStartedLeading(ctx)
le.renew(ctx)
}
There have been several escalations showing that failing to appropriately rate limit k8s control plane API calls like leader election can destabilize clusters.
What testing have we done to ensure this change won't cause issues like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already been doing this. This change will not affect the number of calls, it just makes sure that at least 1 agent will be reporting metrics. The problem with the implementation now is that run goes like this:
- Keeps trying to acquire the lease
- Acquire the lease
- Loose the lease and stop running
All the while, all the other instances were trying to acquire the lease already.
We do not have many SDHs on this bug, which leads me to believe that it is rare for an agent to lose the lease. But we do have problems when agents stop reporting metrics, and the only way we knew how to restart that was to make the pod run again - that is, force run()
to run again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some parameters in the config regarding the lease:
elastic-agent/internal/pkg/composable/providers/kubernetesleaderelection/config.go
Lines 17 to 19 in 43cb148
LeaseDuration int `config:"leader_leaseduration"` | |
RenewDeadline int `config:"leader_renewdeadline"` | |
RetryPeriod int `config:"leader_retryperiod"` |
If it becomes necessary to reduce the amount of times an agent tries to acquire it.
Before I remove my requested changes, do we have a way to test this in this repository? It looks like you'd need a test that you can get the lease again after you've lost it. I see it is possible to inject the client used in the lock implementation, which takes an interface: Line 80 in 09b7410
I also see a fake client is used in the secrets tests: elastic-agent/internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets_test.go Line 36 in 09b7410
Is is possible to mock out the lease client to the test this? |
For the beats PR, I did add an unit test to check that. However, we had a way to change the start leading and stop function there, and this repository I cannot get inside those functions to check if that is working. Because of that, I don't think it is possible to test. There is a way to force the lease to be updated, and sometimes the leader changes. The problem is that we are working with seconds in the lease duration fields, so if I push a test for that, it might take 2min, is that a good idea? I cannot do less than seconds for those fields unlike in the beats PR, because these fields are integers in seconds, and changing them to time.duration means a breaking change. @cmacknz |
Can't you allow access to the lease callbacks you need through an internal constructor, so that the
As above you could bypass the Config object in It seems possible to test this here, although it does look like it'll be more work than it was to write than it was in Beats. The upstream tests are also an interesting source of inspiration https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection_test.go, but they test at a lower level and use a lot more test machinery from their own packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You made the requested changes I asked for and those changes look good.
I agree with Craig on all points for adding unit tests to cover this code
I added a unit test @blakerouse @cmacknz. The test does the following:
|
|
||
select { | ||
case <-done: | ||
case <-time.After(time.Duration(leaseDuration+leaseRetryPeriod) * 20 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this seems possible to happen...
There are only two candidates two acquire the lease, agent1
and agent2
. In this case, we need to wait for agent2
to acquire it. Since I am forcing a new leader, this should imply agent1
losing the lease, which also means the leader elector needs to start again. This should be enough time for agent2
to acquire it.
However, I still added the timeout to make sure this test does not run for very, very long. Should I remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows unit tests hit the timeout... So I removed it. This unit test might take a while to run.
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Outdated
Show resolved
Hide resolved
/test |
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
|
||
getK8sClientFunc = func(kubeconfig string, opt autodiscoverK8s.KubeClientOptions) (kubernetes.Interface, error) { | ||
return client, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be nice would be setting the getK8sClientFunc
back to the original after the test. Something like:
defer func() {
getK8sClientFunc = defaultGetK8sClientFunc
}()
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
|
||
// We need to wait for the first agent to acquire the lease, so we can POD_NAME environment variable again | ||
expectedLeader := leaderElectorPrefix + podNames[i] | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What holder == expectedLeader
doesn't happens here if that never happens? Seems this will loop forever?
might be better to switch to require.Eventually
with a timeout just to ensure and a break of the code somewhere doesn't have this block the unit test forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a timeout. I explained in this comment why I removed it before #4542 (comment).
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Show resolved
Hide resolved
require.NoError(t, err) | ||
|
||
// In this case, we already have an agent as holder | ||
if currentHolder == leaderElectorPrefix+podNames[0] || currentHolder == leaderElectorPrefix+podNames[1] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not adding a timeout to this for cycle because this is a sure thing to happen, since there are only 2 pods running to acquire the lease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test!
internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection_test.go
Outdated
Show resolved
Hide resolved
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
…ernetes_leaderelection_test.go Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
|
Added the backport-v8.14.0 label as this is a bug fix. Feel free to remove it if you do not want this in 8.14.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
* run leader elector at all times --------- Co-authored-by: Blake Rouse <blake.rouse@elastic.co> Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co> (cherry picked from commit d48ac97)
What does this PR do?
This PR covers the issue elastic/beats#38543.
The fix was already merged for beats: elastic/beats#38471.
Now we need to align the agent with the same logic as well.
We do that by making sure the leader elector is always running. Previously, when an instance lost the lease, the leader elector would finish running, and it would not start again.
Checklist
./changelog/fragments
using the changelog toolRelated issues
How to test
Results
The following test was done to make sure the leader elector run again after losing the lease:
Create 2 node cluster, so there are two agent running:
Change the lease, so that the first leader loses it.
First leader:
Change the lease holder. You can do it like this.
After a while, check the new leader:
As we can see, the leader repeated.